multimodal foundation model
Biomedical Visual Instruction Tuning with Clinician Preference Alignment
Recent advancements in multimodal foundation models have showcased impressive capabilities in understanding and reasoning with visual and textual information. Adapting these foundation models trained for general usage to specialized domains like biomedicine requires large-scale domain-specific instruction datasets. While existing works have explored curating such datasets automatically, the resultant datasets are not explicitly aligned with domain expertise. In this work, we propose a data-centric framework, Biomedical Visual Instruction Tuning with Clinician Preference Alignment (BioMed-VITAL), that incorporates clinician preferences into both stages of generating and selecting instruction data for tuning biomedical multimodal foundation models. First, during the generation stage, we prompt the GPT-4V generator with a diverse set of clinician-selected demonstrations for preference-aligned data candidate generation. Then, during the selection phase, we train a separate selection model, which explicitly distills clinician and policy-guided model preferences into a rating function to select high-quality data for medical instruction tuning. Results show that the model tuned with the instruction-following data from our method demonstrates a significant improvement in open visual chat (18.5% relatively) and medical VQA (win rate up to 81.73%). Our instruction-following data and models are available at https://BioMed-VITAL.github.io.
HEMM: Holistic Evaluation of Multimodal Foundation Models
Multimodal foundation models that can holistically process text alongside images, video, audio, and other sensory modalities are increasingly used in a variety of real-world applications. However, it is challenging to characterize and study progress in multimodal foundation models, given the range of possible modeling decisions, tasks, and domains. In this paper, we introduce Holistic Evaluation of Multimodal Models (HEMM) to systematically evaluate the capabilities of multimodal foundation models across a set of 3 dimensions: basic skills, information flow, and real-world use cases. Basic multimodal skills are internal abilities required to solve problems, such as learning interactions across modalities, fine-grained alignment, multi-step reasoning, and the ability to handle external knowledge.
SCAM: A Real-World Typographic Robustness Evaluation for Multimodal Foundation Models
Westerhoff, Justus, Purelku, Erblina, Hackstein, Jakob, Loos, Jonas, Pinetzki, Leo, Rodner, Erik, Hufe, Lorenz
Typographic attacks exploit the interplay between text and visual content in multimodal foundation models, causing misclassifications when misleading text is embedded within images. Existing datasets are limited in size and diversity, making it difficult to study such vulnerabilities. In this paper, we introduce SCAM, the largest and most diverse dataset of real-world typographic attack images to date, containing 1162 images across hundreds of object categories and attack words. Through extensive benchmarking of Vision-Language Models on SCAM, we demonstrate that typographic attacks significantly degrade performance, and identify that training data and model architecture influence the susceptibility to these attacks. Our findings indicate that typographic attacks remain effective against state-of-the-art Large Vision-Language Models, especially those employing vision encoders inherently vulnerable to such attacks. However, employing larger Large Language Model backbones reduces this vulnerability while simultaneously enhancing typographic understanding. Additionally, we demonstrate that synthetic attacks closely resemble real-world (handwritten) attacks, validating their use in research. Our work provides a comprehensive resource and empirical insights to facilitate future research toward robust and trustworthy multimodal AI systems. Finally, we publicly release the datasets introduced in this paper, along with the code for evaluations under www.bliss.berlin/research/scam.
- North America > United States (0.04)
- Europe > Germany > Berlin (0.04)
- Asia > China (0.04)
- Transportation > Ground > Road (1.00)
- Health & Medicine (1.00)
- Consumer Products & Services (1.00)
- (3 more...)
Stress-Testing Multimodal Foundation Models for Crystallographic Reasoning
Polat, Can, Kurban, Hasan, Serpedin, Erchin, Kurban, Mustafa
Evaluating foundation models for crystallographic reasoning requires benchmarks that isolate generalization behavior while enforcing physical constraints. This work introduces a multiscale multicrystal dataset with two physically grounded evaluation protocols to stress-test multimodal generative models. The Spatial-Exclusion benchmark withholds all supercells of a given radius from a diverse dataset, enabling controlled assessments of spatial interpolation and extrapolation. The Compositional-Exclusion benchmark omits all samples of a specific chemical composition, probing generalization across stoichiometries. Nine vision--language foundation models are prompted with crystallographic images and textual context to generate structural annotations. Responses are evaluated via (i) relative errors in lattice parameters and density, (ii) a physics-consistency index penalizing volumetric violations, and (iii) a hallucination score capturing geometric outliers and invalid space-group predictions. These benchmarks establish a reproducible, physically informed framework for assessing generalization, consistency, and reliability in large-scale multimodal models. Dataset and code are available at https://github.com/KurbanIntelligenceLab/StressTestingMMFMinCR.
- North America > United States > Texas > Brazos County > College Station (0.14)
- Asia > Middle East > Republic of Türkiye > Ankara Province > Ankara (0.04)
- Asia > Middle East > Qatar > Ad-Dawhah > Doha (0.04)
- Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.94)
- Information Technology > Artificial Intelligence > Natural Language > Chatbot (0.68)
The Strategic Imperative for Healthcare Organizations to Build Proprietary Foundation Models
This paper presents a comprehensive analysis of the strategic imperative for healthcare organizations to develop proprietary foundation models rather than relying exclusively on commercial alternatives. We examine four fundamental considerations driving this imperative: the domain-specific requirements of healthcare data representation, critical data sovereignty and governance considerations unique to healthcare, strategic competitive advantages afforded by proprietary AI infrastructure, and the transformative potential of healthcare-specific foundation models for patient care and organizational operations. Through analysis of empirical evidence, economic frameworks, and organizational case studies, we demonstrate that proprietary multimodal foundation models enable healthcare organizations to achieve superior clinical performance, maintain robust data governance, create sustainable competitive advantages, and accelerate innovation pathways. While acknowledging implementation challenges, we present evidence showing organizations with proprietary AI capabilities demonstrate measurably improved outcomes, faster innovation cycles, and stronger strategic positioning in the evolving healthcare ecosystem. This analysis provides healthcare leaders with a comprehensive framework for evaluating build-versus-buy decisions regarding foundation model implementation, positioning proprietary foundation model development as a cornerstone capability for forward-thinking healthcare organizations.
- Europe (0.28)
- North America > United States > California > San Francisco County > San Francisco (0.14)
- North America > United States > Pennsylvania (0.04)
- (2 more...)
- Research Report > New Finding (1.00)
- Research Report > Experimental Study (1.00)
Multimodal Foundation Model for Cross-Modal Retrieval and Activity Recognition Tasks
Matsuishi, Koki, Ukita, Kosuke, Okita, Tsuyoshi
In recent years, the widespread adoption of wearable devices has highlighted the growing importance of behavior analysis using IMU. While applications span diverse fields such as healthcare and robotics, recent studies have increasingly focused on multimodal analysis, in addition to unimodal analysis. Several studies have proposed multimodal foundation models that incorporate first-person video and text data; however, these models still fall short in providing a detailed analysis of full-body human activity. To address this limitation, we propose Activity Understanding and Representations Alignment - Multimodal Foundation Model (AURA-MFM), a foundational model integrating four modalities: third-person video, motion capture, IMU, and text. By incorporating third-person video and motion capture data, the model enables a detailed and multidimensional understanding of human activity, which first-person perspectives alone fail to capture. Additionally, a Transformer-based IMU encoder is employed to enhance the model's overall performance. Experimental evaluations on retrieval and activity recognition tasks demonstrate that our model surpasses existing methods. Notably, in the zero-shot classification for action recognition, our method achieved significantly higher performance, with an F1-score of 0.6226 and an accuracy of 0.7320, whereas the existing method recorded an F1-score of 0.0747 and an accuracy of 0.1961.
- Health & Medicine (0.66)
- Information Technology (0.48)
- Leisure & Entertainment (0.47)
WONDERBREAD: A Benchmark for Evaluating Multimodal Foundation Models on Business Process Management Tasks
Existing ML benchmarks lack the depth and diversity of annotations needed for evaluating models on business process management (BPM) tasks. BPM is the practice of documenting, measuring, improving, and automating enterprise workflows. However, research has focused almost exclusively on one task -- full end-to-end automation using agents based on multimodal foundation models (FMs) like GPT-4. This focus on automation ignores the reality of how most BPM tools are applied today -- simply documenting the relevant workflow takes 60% of the time of the typical process optimization project. To address this gap we present WONDERBREAD, the first benchmark for evaluating multimodal FMs on BPM tasks beyond automation.
Biomedical Visual Instruction Tuning with Clinician Preference Alignment
Recent advancements in multimodal foundation models have showcased impressive capabilities in understanding and reasoning with visual and textual information. Adapting these foundation models trained for general usage to specialized domains like biomedicine requires large-scale domain-specific instruction datasets. While existing works have explored curating such datasets automatically, the resultant datasets are not explicitly aligned with domain expertise. In this work, we propose a data-centric framework, Biomedical Visual Instruction Tuning with Clinician Preference Alignment (BioMed-VITAL), that incorporates clinician preferences into both stages of generating and selecting instruction data for tuning biomedical multimodal foundation models. First, during the generation stage, we prompt the GPT-4V generator with a diverse set of clinician-selected demonstrations for preference-aligned data candidate generation.
HEMM: Holistic Evaluation of Multimodal Foundation Models
Multimodal foundation models that can holistically process text alongside images, video, audio, and other sensory modalities are increasingly used in a variety of real-world applications. However, it is challenging to characterize and study progress in multimodal foundation models, given the range of possible modeling decisions, tasks, and domains. In this paper, we introduce Holistic Evaluation of Multimodal Models (HEMM) to systematically evaluate the capabilities of multimodal foundation models across a set of 3 dimensions: basic skills, information flow, and real-world use cases. Basic multimodal skills are internal abilities required to solve problems, such as learning interactions across modalities, fine-grained alignment, multi-step reasoning, and the ability to handle external knowledge. Use cases span domain-specific challenges introduced in real-world multimedia, affective computing, natural sciences, healthcare, and human-computer interaction applications. Through comprehensive experiments across the 30 tasks in HEMM, we (1) identify key dataset dimensions (e.g., basic skills, information flows, and use cases) that pose challenges to today's models, and (2) distill performance trends regarding how different modeling dimensions (e.g., scale, pre-training data, multimodal alignment, pre-training, and instruction tuning objectives) influence performance.
MMDT: Decoding the Trustworthiness and Safety of Multimodal Foundation Models
Xu, Chejian, Zhang, Jiawei, Chen, Zhaorun, Xie, Chulin, Kang, Mintong, Potter, Yujin, Wang, Zhun, Yuan, Zhuowen, Xiong, Alexander, Xiong, Zidi, Zhang, Chenhui, Yuan, Lingzhi, Zeng, Yi, Xu, Peiyang, Guo, Chengquan, Zhou, Andy, Tan, Jeffrey Ziwei, Zhao, Xuandong, Pinto, Francesco, Xiang, Zhen, Gai, Yu, Lin, Zinan, Hendrycks, Dan, Li, Bo, Song, Dawn
Multimodal foundation models (MMFMs) play a crucial role in various applications, including autonomous driving, healthcare, and virtual assistants. However, several studies have revealed vulnerabilities in these models, such as generating unsafe content by text-to-image models. Existing benchmarks on multimodal models either predominantly assess the helpfulness of these models, or only focus on limited perspectives such as fairness and privacy. In this paper, we present the first unified platform, MMDT (Multimodal DecodingTrust), designed to provide a comprehensive safety and trustworthiness evaluation for MMFMs. Our platform assesses models from multiple perspectives, including safety, hallucination, fairness/bias, privacy, adversarial robustness, and out-of-distribution (OOD) generalization. We have designed various evaluation scenarios and red teaming algorithms under different tasks for each perspective to generate challenging data, forming a high-quality benchmark. We evaluate a range of multimodal models using MMDT, and our findings reveal a series of vulnerabilities and areas for improvement across these perspectives. This work introduces the first comprehensive and unique safety and trustworthiness evaluation platform for MMFMs, paving the way for developing safer and more reliable MMFMs and systems. Our platform and benchmark are available at https://mmdecodingtrust.github.io/.
- Europe > Switzerland > Zürich > Zürich (0.13)
- North America > United States > Illinois > Cook County > Chicago (0.04)
- Oceania (0.04)
- (6 more...)
- Transportation > Ground > Road (1.00)
- Law (1.00)
- Information Technology > Security & Privacy (1.00)
- (4 more...)